{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%%shell\n",
        "# Installs the latest dev build of TVM from PyPI. If you wish to build\n",
        "# from source, see https://tvm.apache.org/docs/install/from_source.html\n",
        "pip install apache-tvm --pre"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 基于 VIDIA GPU 的神经网络自动调度\n",
        "**原作者**: [Lianmin Zheng](https://github.com/merrymercy)\n",
        "\n",
        "针对特定设备和工作负载的自动调优对于获得最佳性能至关重要。下面是关于如何用自动调度器调优 NVIDIA GPU 的整个神经网络的教程。\n",
        "\n",
        "为了自动调优神经网络，需要将网络划分为小的子图，并独立地调优它们。每个子图被视为一个搜索任务。任务调度程序对时间进行切片，并动态地为这些任务分配时间资源。任务调度器预测每个任务对端到端执行时间的影响，并优先考虑能够最大程度减少执行时间的任务。\n",
        "\n",
        "对于每个子图，使用 `tvm/python/topi` 中的 compute 声明来获得张量表达式形式的计算 DAG。然后，使用自动调度器来构造 DAG 的搜索空间，并搜索良好的调度（低级优化）。\n",
        "\n",
        "与基于模板的 [autotvm](../tune_with_autotvm/index) 依赖手动模板定义搜索空间不同，自动调度程序不需要任何调度模板。换句话说，自动调度器只使用 `tvm/python/topi` 中的 compute，而不使用现有的调度模板。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "\n",
        "import tvm\n",
        "from tvm import relay, auto_scheduler\n",
        "import tvm.relay.testing\n",
        "from tvm.contrib import graph_executor"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 定义网络\n",
        "\n",
        "首先，需要用 relay 前端 AP I定义网络。可以从 {mod}`tvm.relay.testing` 加载一些预定义的网络。还可以从 MXNet、ONNX、PyTorch 和 TensorFlow 加载模型（参见[前端教程](../compile_models/index)）。\n",
        "\n",
        "对于卷积神经网络，尽管自动调度器可以在任何布局下正确工作，但我们发现 NHWC 布局通常能获得最佳性能。我们还通过自动调度器实现了对 NHWC 布局的更多优化。因此，建议将模型转换为 NHWC 布局以使用自动调度器。可以使用 [ConvertLayout pass](convert-layout-usage) 在 TVM 中进行布局转换。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def get_network(name, batch_size, layout=\"NHWC\", dtype=\"float32\"):\n",
        "    \"\"\"Get the symbol definition and random weight of a network\"\"\"\n",
        "\n",
        "    # auto-scheduler prefers NHWC layout\n",
        "    if layout == \"NHWC\":\n",
        "        image_shape = (224, 224, 3)\n",
        "    elif layout == \"NCHW\":\n",
        "        image_shape = (3, 224, 224)\n",
        "    else:\n",
        "        raise ValueError(\"Invalid layout: \" + layout)\n",
        "\n",
        "    input_shape = (batch_size,) + image_shape\n",
        "    output_shape = (batch_size, 1000)\n",
        "\n",
        "    if name.startswith(\"resnet-\"):\n",
        "        n_layer = int(name.split(\"-\")[1])\n",
        "        mod, params = relay.testing.resnet.get_workload(\n",
        "            num_layers=n_layer,\n",
        "            batch_size=batch_size,\n",
        "            layout=layout,\n",
        "            dtype=dtype,\n",
        "            image_shape=image_shape,\n",
        "        )\n",
        "    elif name.startswith(\"resnet3d-\"):\n",
        "        n_layer = int(name.split(\"-\")[1])\n",
        "        mod, params = relay.testing.resnet.get_workload(\n",
        "            num_layers=n_layer,\n",
        "            batch_size=batch_size,\n",
        "            layout=layout,\n",
        "            dtype=dtype,\n",
        "            image_shape=image_shape,\n",
        "        )\n",
        "    elif name == \"mobilenet\":\n",
        "        mod, params = relay.testing.mobilenet.get_workload(\n",
        "            batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape\n",
        "        )\n",
        "    elif name == \"squeezenet_v1.1\":\n",
        "        assert layout == \"NCHW\", \"squeezenet_v1.1 only supports NCHW layout\"\n",
        "        mod, params = relay.testing.squeezenet.get_workload(\n",
        "            version=\"1.1\",\n",
        "            batch_size=batch_size,\n",
        "            dtype=dtype,\n",
        "            image_shape=image_shape,\n",
        "        )\n",
        "    elif name == \"inception_v3\":\n",
        "        input_shape = (batch_size, 3, 299, 299) if layout == \"NCHW\" else (batch_size, 299, 299, 3)\n",
        "        mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)\n",
        "    elif name == \"mxnet\":\n",
        "        # an example for mxnet model\n",
        "        from mxnet.gluon.model_zoo.vision import get_model\n",
        "\n",
        "        assert layout == \"NCHW\"\n",
        "\n",
        "        block = get_model(\"resnet18_v1\", pretrained=True)\n",
        "        mod, params = relay.frontend.from_mxnet(block, shape={\"data\": input_shape}, dtype=dtype)\n",
        "        net = mod[\"main\"]\n",
        "        net = relay.Function(\n",
        "            net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs\n",
        "        )\n",
        "        mod = tvm.IRModule.from_expr(net)\n",
        "\n",
        "    return mod, params, input_shape, output_shape\n",
        "\n",
        "\n",
        "# Define the neural network and compilation target\n",
        "network = \"resnet-18\"\n",
        "batch_size = 1\n",
        "layout = \"NHWC\"\n",
        "target = tvm.target.Target(\"cuda\")\n",
        "dtype = \"float32\"\n",
        "log_file = \"%s-%s-B%d-%s.json\" % (network, layout, batch_size, target.kind.name)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 提取搜索任务\n",
        "\n",
        "接下来，从网络中提取搜索任务及其权重。任务的权重是该任务的子图在整个网络中出现的次数。通过使用权重，可以将网络的端到端延迟近似为 `sum(latency[t] * weight[t])`，其中 `latency[t]` 是任务的延迟，`weight[t]` 是任务的权重。任务调度器会优化这个目标。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# 从网络中提取任务\n",
        "print(\"Extract tasks...\")\n",
        "mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)\n",
        "tasks, task_weights = auto_scheduler.extract_tasks(mod[\"main\"], params, target)\n",
        "\n",
        "for idx, task in enumerate(tasks):\n",
        "    print(\"========== Task %d  (workload key: %s) ==========\" % (idx, task.workload_key))\n",
        "    print(task.compute_dag)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 开始调优\n",
        "\n",
        "现在，设置了一些调优选项并启动搜索任务\n",
        "\n",
        "- `measure_ctx` 启动不同的测量进程以提供隔离。它可以在测量期间保护主进程不受 GPU 崩溃的影响，并避免其他运行时冲突。\n",
        "- `min_repeat_ms` 定义每次测量中一次“重复”的最小持续时间。这可以预热 GPU，这对于获得准确的测量结果是必要的。通常，建议值 >= 300 ms。\n",
        "- `num_measure_trials` 是在调优期间可以使用的度量试验的数量。可以将它设置为小的数字（例如，200）以进行快速演示运行。在实践中，建议将它设置在 `900 * len(tasks)` 左右，这通常足以让搜索收敛。例如 resnet-18 中有 24 个任务，所以可以将其设置为 20000。可以根据自己的时间预算调整该参数。\n",
        "- 此外，使用 `RecordToFile` 将测量记录转储到日志文件中，测量记录可以用于查询历史，恢复搜索，并在以后进行更多的分析。\n",
        "- 查阅 {mod}`tvm.auto_scheduler.TuningOptions`、{mod}`tvm.auto_scheduler.LocalRPCMeasureContext` 获取更多参数。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def run_tuning():\n",
        "    print(\"开始调优...\")\n",
        "    measure_ctx = auto_scheduler.LocalRPCMeasureContext(repeat=1, min_repeat_ms=300, timeout=10)\n",
        "\n",
        "    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)\n",
        "    tune_option = auto_scheduler.TuningOptions(\n",
        "        num_measure_trials=200,  # change this to 20000 to achieve the best performance\n",
        "        runner=measure_ctx.runner,\n",
        "        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],\n",
        "    )\n",
        "    tuner.tune(tune_option)\n",
        "\n",
        "run_tuning()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "````{admonition} 解释在调优期间的打印信息\n",
        ":class: alert alert-info\n",
        "在调优过程中，控制台上将打印大量信息。它们用于调试目的。最重要的信息是任务调度器的输出。下表是示例输出。\n",
        "\n",
        "```c\n",
        "----------------------------------------------------------------------\n",
        "------------------------------  [ Task Scheduler ]\n",
        "----------------------------------------------------------------------\n",
        "|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |\n",
        "-------------------------------------------------\n",
        "|    0 |        0.005 |           0.88 |     64 |\n",
        "|    1 |        0.010 |          99.10 |     64 |\n",
        "|    2 |        0.006 |           0.00 |     64 |\n",
        "|    3 |        0.145 |         979.78 |    384 |\n",
        "|    4 |        0.130 |        1097.02 |    384 |\n",
        "|    5 |        0.143 |         992.69 |    384 |\n",
        "|    6 |        0.076 |        1526.86 |    192 |\n",
        "|    7 |        0.115 |         999.44 |    320 |\n",
        "|    8 |        0.079 |        1449.39 |    320 |\n",
        "|    9 |        0.122 |         938.73 |    384 |\n",
        "|   10 |        0.063 |        1832.98 |    192 |\n",
        "|   11 |        0.072 |        1763.62 |    256 |\n",
        "|   12 |        0.062 |        2036.40 |    192 |\n",
        "|   13 |        0.068 |        1874.44 |    192 |\n",
        "|   14 |        0.049 |        2346.50 |    128 |\n",
        "|   15 |        0.076 |        1694.31 |    256 |\n",
        "|   16 |        0.067 |        1933.30 |    448 |\n",
        "|   17 |        0.076 |        1680.90 |    256 |\n",
        "|   18 |        0.022 |          98.43 |     64 |\n",
        "|   19 |        0.076 |        3112.55 |    192 |\n",
        "|   20 |        0.013 |        2026.44 |     64 |\n",
        "|   21 |        0.011 |        1136.69 |     64 |\n",
        "|   22 |        0.013 |         992.47 |     64 |\n",
        "|   23 |        0.020 |         627.56 |     64 |\n",
        "-------------------------------------------------\n",
        "Estimated total latency: 1.587 ms  Trials: 4992  Used time : 13296 s  Next ID: 3\n",
        "```\n",
        "\n",
        "该表列出了所有任务的延迟和（估计的）速度。它还列出了所有任务的测量试验分配。最后一行打印这些任务的总加权延迟，这可以粗略估计网络的端到端执行时间。最后一行还输出测试的总数、自动调优所花费的总时间和下一个要调优的任务的 id。\n",
        "\n",
        "也会有一些 \"tvm::Error\" 和 CUDA 的错误，因为自动调度程序将尝试一些无效的调度。如果可以继续进行调优，则可以安全地忽略它们，因为这些错误与主进程隔离开来。\n",
        "````"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "```{admonition} 提前终止调优\n",
        ":class: alert alert-info\n",
        "可以通过强制终止此进程提前终止调优。只要为日志文件中的每个任务获得至少一个有效的调度，就应该能够进行编译（见下面的部分）。\n",
        "```\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 编译和评估\n",
        "\n",
        "在自动调优之后，可以用找到的最佳调度来编译网络。在自动调优期间，所有测量记录都被转储到日志文件中，因此可以读取日志文件并加载最佳调度。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Compile with the history best\n",
        "print(\"Compile...\")\n",
        "with auto_scheduler.ApplyHistoryBest(log_file):\n",
        "    with tvm.transform.PassContext(opt_level=3, config={\"relay.backend.use_auto_scheduler\": True}):\n",
        "        lib = relay.build(mod, target=target, params=params)\n",
        "\n",
        "# Create graph executor\n",
        "dev = tvm.device(str(target), 0)\n",
        "module = graph_executor.GraphModule(lib[\"default\"](dev))\n",
        "data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))\n",
        "module.set_input(\"data\", data_tvm)\n",
        "\n",
        "# Evaluate\n",
        "print(\"Evaluate inference time cost...\")\n",
        "print(module.benchmark(dev, repeat=3, min_repeat_ms=500))"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 其他技巧\n",
        "\n",
        "1. 在调优过程中，自动调度器需要编译许多程序并从中提取特征。该部分是 CPU 密集型的，因此建议使用多核的高性能 CPU，以加快搜索速度。\n",
        "2. 可以使用 `python3 -m tvm.auto_scheduler.measure_record --mode distill -i log.json` 提取大的日志文件，只保存最好的有用的记录。\n",
        "3. 可以从上一个日志文件恢复搜索。在函数 `run_tuning` 中创建任务调度器时，只需要添加新的参数 `load_log_file`，即 `tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)`。\n",
        "4. 如果有多个目标 CPU，您可以将它们都用于测量，从而使测量并行化。请查阅[如何使用 RPC 跟踪器和 RPC 服务器](tutorials-autotvm-scale-up-rpc-tracker)。要在自动调度器中使用 RPC 跟踪器，请将 `TuningOptions` 中的运行器替换为 {any}`auto_scheduler.RPCRunner`。"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
